The theoretical distribution of the median is often not very friendly. To get a handle on what the distribution looks like for samples from different population distributions, we can use Efron’s bootstrap.
We will make ample use of the sample function. This function returns either a random subset of the original data, or a permutation — depending upon the number of items that you request it returns.
x <- 1:10
x
## [1] 1 2 3 4 5 6 7 8 9 10
### Get a random sample of size 5 from the n=10 w/o replacement
sample(x, 5)
## [1] 9 4 10 5 3
### Get a random permutation of the n=10
sample(x)
## [1] 6 4 10 5 2 8 9 7 1 3
### Get a bootstrap sample (of size n from the n=10 w/ replacement)
bootsamp <- sample(x, replace = TRUE)
bootsamp
## [1] 4 4 7 7 5 3 2 4 10 6
The default is to have each item have a selection probability of 1/n. We can specify other probabilities if we so choose.
### Creatae a boostrap sample from x with probabilities that favor the first five elements
prob1 <- c(rep(.15, 5), rep(.05, 5))
prob1
## [1] 0.15 0.15 0.15 0.15 0.15 0.05 0.05 0.05 0.05 0.05
sample(10, replace = TRUE, prob=prob1)
## [1] 6 4 5 4 5 7 4 2 2 5
Sampling from a matrix is done similarly. We can sample from all elements of the matrix.
### Create a matrix filled with random values
y1 <- matrix( round(rnorm(25,5)), ncol=5)
y1
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 5 4 3 6
## [2,] 4 3 4 4 5
## [3,] 5 5 3 4 6
## [4,] 4 5 5 3 6
## [5,] 5 8 5 5 5
### Save a sample of size 5 in the vector x1
x1 <- y1[sample(25, 5)]
x1
## [1] 4 5 3 8 5
Or, we can sample entire rows.
### Create a matrix filled with random values
y2 <- matrix( round(rnorm(40, 5)), ncol=5)
y2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 4 5 5 6 6
## [2,] 6 6 3 2 6
## [3,] 4 4 5 4 5
## [4,] 5 5 5 6 2
## [5,] 5 4 5 6 4
## [6,] 6 4 3 5 5
## [7,] 3 6 2 5 5
## [8,] 6 6 4 6 6
### Save a sample of rows from y2 in the matrix x2
x2 <- y2[sample(8, 3), ]
x2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3 6 2 5 5
## [2,] 6 4 3 5 5
## [3,] 6 6 3 2 6
In the following example we find an estimate of the standard error for the estimate of the median. We will be using the lapply and sapply functions in combination with the sample function. (More information about the lapply and sapply functions can be found in the R help pages.)
### Create a data set by taking 100 random observations from a normal
### distribution with mean 5 and stdev 3. Each observation has been
### rounded to the nearest integer.
data <- round(rnorm(100, 5, 3))
### Display the first ten observations
data[1:10]
## [1] 4 8 8 5 2 4 8 6 2 8
The lapply function applies a given function to a list by passing the list elements to the function. We use lapply to run a function that generates a sample of the same size as the original sample by sampling with replacement.
### Generate 25 bootstrap samples. Note that the function has an argument i
### that is not passed to the sample function which uses the global
### data instead.
resamples <- lapply(1:25, function(i) sample(data, replace = T))
### Display the results for the third resample --- the others are similar.
resamples[3]
## [[1]]
## [1] 4 6 5 5 7 3 9 5 1 6 5 4 6 6 4 5 5 1 9 8 5 7 1 6 7 6 4 5 4 6 8 6 2 5 6 4 2
## [38] 4 8 6 2 4 4 4 4 4 4 6 6 4 8 6 3 9 6 4 3 6 1 6 0 6 6 8 5 5 7 2 5 5 9 8 7 4
## [75] 1 5 2 4 7 4 2 3 4 6 6 9 4 1 6 6 4 3 5 4 6 5 5 7 6 7
We can now apply the median function to each of the bootstrap samples to generate a bootstrap distribution.
### Calculate the median for each bootstrap sample
r.median <- sapply(resamples, median)
r.median
## [1] 5.0 5.0 5.0 4.0 5.0 5.0 5.0 4.5 4.5 5.0 4.0 5.0 5.0 4.0 5.0 4.0 5.0 5.0 5.0
## [20] 5.0 5.0 5.0 5.0 5.0 4.0
### Compute summary statistics for the distribution of the medians
summary(r.median)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 4.50 5.00 4.76 5.00 5.00
### Display a histogram of the distribution of the medians
hist(r.median)
abline(v=mean(data), lty=3)
abline(v=mean(r.median), lty=1)
### Display a normal probability plot
qqnorm(r.median)
qqline(r.median)
### Display a quantile-quantile plot using a standard normal
n <- length(r.median)
qqplot(qnorm(ppoints(n)), r.median,
main = "Q-Q plot for Normal")
qqline(r.median, distribution = function(p){qnorm(p)},
probs = c(0.25, 0.75), col = 2)
### Display a quantile-quantile plot using a chiquare
n <- length(r.median)
qqplot(qchisq(ppoints(n), df = n-1), r.median,
main = expression("Q-Q plot for" ~~ {chi^2}[nu == n-1]))
qqline(r.median, qtype = 5, distribution = function(p){qchisq(p, df = n-1)},
probs = c(0.1, 0.6), col = 2)
mtext("qqline(*, dist = qchisq(., df=n-1), prob = c(0.1, 0.6))")
These steps can be combined into a single function where all we would need to specify is which data set to use and how many times we want to resample in order to obtain the adjusted standard error of the median.
### Function to bootstrap the mean, standard error, and bias of the median
b.median <- function(data, nboot=9999) {
resamples <- lapply(1:nboot, function(i) sample(data, replace=T))
r.median <- sapply(resamples, median)
d.xbar <- mean(data)
r.xbar <- mean(r.median)
r.bias <- r.xbar - d.xbar
r.stderr <- sqrt(var(r.median))
list(xbar=r.xbar, std.err=r.stderr, bias=r.bias, resamples=resamples,
medians=r.median
)
}
### Create some data to be used (same as in the above example)
data1 <- round(rnorm(100, 5, 3))
### Save the results of 10000 boots of the function b.median in the object b1
b1 <- b.median(data1, 10000)
### Display the third of the 10000 bootstrap samples
b1$resamples[3]
## [[1]]
## [1] 8 4 0 6 4 3 7 9 9 9 4 8 9 8 4 8 8 5 3 4 4 3 2 6 5
## [26] 8 8 2 3 8 3 9 13 3 4 6 4 3 2 5 6 6 8 3 8 0 5 0 0 11
## [51] 0 5 7 10 0 7 7 11 3 6 2 1 7 6 4 4 4 7 8 7 4 -1 7 2 4
## [76] 5 11 8 5 8 2 2 6 11 4 0 0 3 2 3 5 13 0 2 6 -3 5 8 8 3
### Display the mean, standard error, and bias of bootstrapped medians
print(b1$xbar, b1$std.err, b1$bias)
## [1] 5
### Display a histogram of the distribution of medians
hist(b1$medians)
### Make a normal probability plot
qqnorm(b1$medians)
qqline(b1$medians)
### We can input the data directly into the function and display
### the standard error in one line of code.
b.median(data1)$std.err
## [1] 0.5794808
### Or, we can make the histogram in one line
hist(b.median(data1)$medians)
Note that the single line calls generate plots based upon different bootstrap samples. This approach is also not very time efficient.
It would be fairly simple to generalize the function to work for any summary statistic. We show this approach for the sample mean.
### Function to bootstrap the mean, standard error, and bias of a statistic
b.stat <- function(data, nboot=9999, fnct = mean, ...) {
resamples <- lapply(1:nboot, function(i) sample(data, replace=T))
r.boots <- sapply(resamples, fnct, ...)
d.xbar <- mean(data)
r.xbar <- mean(r.boots)
r.bias <- r.xbar - d.xbar
r.stderr <- sqrt(var(r.boots))
list(xbar=r.xbar, std.err=r.stderr, bias=r.bias, resamples=resamples,
boots=r.boots
)
}
### Reuse the data created above so that we can compare mean and median
### Save the results of 10000 boots of the function b.median in the object b2
b2 <- b.stat(data1, 10000, function(x){mean(x)})
### Display the third of the 10000 bootstrap samples
b2$resamples[3]
## [[1]]
## [1] 4 8 5 8 10 6 7 5 5 5 8 8 8 5 -1 2 7 11 6 6 13 6 7 7 6
## [26] 10 6 3 0 8 5 3 7 8 7 8 7 4 9 4 4 2 -3 5 4 11 9 13 3 7
## [51] 6 11 3 3 4 8 3 8 8 5 4 7 6 3 8 4 5 9 9 3 4 0 4 4 4
## [76] 11 3 5 2 3 8 6 3 7 8 5 4 9 7 5 11 5 8 9 4 8 4 5 5 8
### Display the mean, standard error, and bias of bootstrapped means
print(round(c(b2$xbar, b2$std.err, b2$bias), 4))
## [1] 5.5179 0.2995 -0.0021
### Display a histogram of the distribution of means
hist(b2$boots)
abline(v = b2$xbar, lty = 1)
abline(v = mean(data1), lty = 3)
### Make a normal probability plot
qqnorm(b2$boots)
qqline(b2$boots)
There exists a package, boot, that already has a bootstrapping function that acts like the function that we created in the code chunk above. The advantage of using boot is that it has some additional methods of doing things.
### Bootstrap of the median of the data1 data from above.
### Make sure that the boot package is installed using
### install.packages("boot"), or use a package like pacman to take care
### of installation and loading.
#library(boot)
p_load(boot)
### Define the median function with data, d, and boot sample indices, i.
mystat <- function(d, i){
median(d[i])
}
### Use the boot function to run the bootstrap
b3 <- boot(data1, mystat, R=9999)
b3
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data1, statistic = mystat, R = 9999)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 5.5 -0.0190019 0.5745912
plot(b3)
### Display a quantile-quantile plot using a chiquare
b3$t[1:15] ### The boot statistics are stored in a variable, t, within the the boot object.
## [1] 5 5 5 6 6 5 6 6 4 5 6 5 5 6 6
n <- length(b3$t)
qqplot(qchisq(ppoints(n), df = n-1), b3$t,
main = expression("Q-Q plot for" ~~ {chi^2}[nu == n-1]))
qqline(b3$t, qtype = 5, distribution = function(p){qchisq(p, df = n-1)},
probs = c(0.1, 0.6), col = 2)
mtext("qqline(*, dist = qchisq(., df=n-1), prob = c(0.1, 0.6))")
More complex statistics can be bootstrapped. We look at the mean ratio of weight to height for college students.
### Get the data
htwt <- read.csv("http://facweb1.redlands.edu/fac/jim_bentley/downloads/math111/htwt.csv")
head(htwt)
## Height Weight Group
## 1 64 159 1
## 2 63 155 2
## 3 67 157 2
## 4 60 125 1
## 5 52 103 2
## 6 58 122 2
### Usual bootstrap of the ratio of means
ratio <- function(d, i){ mean(d$Weight[i] / d$Height[i]) }
b4 <- boot(htwt, ratio, R = 9999)
b4
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = htwt, statistic = ratio, R = 9999)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 2.20046 -0.0002817692 0.08496458
plot(b4)
### Display a quantile-quantile plot using a gamma --- which it probably isn't
b4$t[1:15] ### The boot statistics are stored in a variable, t, within the the boot object.
## [1] 2.135591 2.229664 2.316977 2.228467 2.372114 2.210358 2.200328 2.172293
## [9] 2.168086 2.060898 2.134306 2.185192 2.188948 2.106805 2.189907
n <- length(b4$t)
s <- var(b4$t)/mean(b4$t) ### Estimate the scale parameter
a <- mean(b4$t)/s ### Estimate the shape
qqplot(qgamma(ppoints(n), shape = a, scale = s), b4$t,
main = paste0("Q-Q plot for Gamma(shape = ", round(a,2), ", scale = ", round(s,4), ")")
)
qqline(b4$t, qtype = 5, distribution = function(p){qgamma(p, shape = a, scale = s)},
probs = c(0.1, 0.6), col = 2)